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ERRATIC LEARNING CURVE 

John Ludbrook 
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SUMMARY 

1. Learning how to apply statistical analyses to the results of 
experimental or clinical studies may take a lifetime of trial (and 
sometimes error)* as it has done in the author's case. There is 
no evidence that biomedical Investigators of the present gener- 
ation are on a steeper learning curve* Gross misunderstandings 
of the purpose and functions of statistical analysis are apparent 
in applications to research grant-giving bodies and ethics com- 
mittees, in manuscripts submitted to journals and sometimes in 
published papers. 

2. Although estimation of minimal group (sample) size for a 
given power is an essential step in planning clinical studies, It 
seems to be used rarely in laboratory experimental work. This 
Is despite exhortations to restrict the number of animals used 
to a minimum. 

X Most investigators use hypothesis testing to analyse their 
results, but their understanding of the meaning of the resultant 
P-values is slight 

4. A flaw found almost universally in biomedical manuscripts 
is to make multiple inferences from the results of a single study. 
The goal of statistical analysis is to maintain the familywise type 
I error rate (risk of false -positive Inference) at a predetermined 
level (usually 5%). But, when multiple inferences are made from 
the same experiment, the risk of false-positive error is inflated. 
There are two solutions to this problem: (i) use a multiple com- 
parison procedure to control the familywise type I error rate; 
and (ii) test a single, global hypothesis* 

5. Biomedical investigators have been quick to acquire com- 
puter statistics software and to use it to analyse their experi- 
ments. However, they have been slow to recognize the limitations 
of this software. These include: (i) inadequate documentation 
of routines, so that neither the user nor the reader of published 
papers can be sure how the tests have been executed; (il) flawed 
algorithms for the execution of statistical procedures; and (iii) 
failure to recognize that the best software for their purposes Is 
that which takes them just beyond their statistical horizons. 
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6. The obvious solution to these difficulties Is to recruit a 
biomedical statistician into every research group, at a relatively 
trivial cost. However, properly qualified biostatisticlans are in 
desperately short supply in Australia. It follows that research 
groups, national grant-giving agencies and academic Institutions 
must make provision for the proper training and subsequent 
employment of biostatisticlans* 

Key words: biostatisticlans, global hypotheses, minimal group 
size, multiple comparison procedures, multiple inferences, 
power, P-values, Ryan-Holm, statistics software, type I error 
rate. 



INTRODUCTION 

Over the past 50 years, the author has progressed from executing 
X 2 -tests in 1951 to complex analyses of variance with as many as 
1 3 independent terms in 2000. This has not been a linear, or a pain- 
less, progression. For instance, in 1965, 1 presented my first paper 
before the Australian Physiological and Pharmacological Society 
(APPS). A distinguished member of APPS, who was both a physi- 
ologist and a mathematician, reduced me to incoherence by point- 
ing out (quite correctly) the distinction between independent and 
related observations. Thanks to his comments, 1 was able to correct 
this error in the final publication. 

The first exhortation to physiologists to analyse their results 
statistically dates back to a massive review by Dunn in 1929, which 
contained no fewer than 694 references. 2 This was almost premature, 
for, at that time, the techniques avaitable consisted of calculating 
means and standard deviations, Pearson's product-moment correl- 
ation coefficient, Pearsons's x 2 test and Student's f-test (not long after 
the / distribution was converted to a practical test of significance by 
Fisher in 1925 3 ). Dunn did not mention analysis of variance, 
although Fisher had at least hinted at it by 1923. It is worth reciting 
another piece of history. Virtually all the statistical procedures used 
today were described before the middle of the 20th century: that is, 
in the precomputer era. Almost the only exceptions to this rule are 
the technique of bootstrapping, which is heavily computer dependent 
and some of the forms of survival analysis used in clinical trials. 
Thus, the lead time from description of a new technique to its enter- 
ing into the consciousness and practice of biomedical investigators 
seems to be approximately 50 years. 

In what follows, I shall attempt to describe what I see as some 
of the statistical difficulties facing biomedical investigators at the 
beginning of the third millenium. Despite the Gregorian origin of 
our calendar, there will be nothing especially Christian in my 
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comments. They refer to matters such as the estimation of group 
(sample) size in advance of studies, the problem of multiple infer- 
ences from a single experiment, the dangers of embracing computer 
statistical software uncritically and the unfulfilled need for bio- 
statisticians to be members of research teams. 

ESTIMATION OF MINIMAL GROUP SIZE AND 
POWER 

There is an enormous literature on the importance of adequate, yet 
not excessive, sized samples in clinical studies, especially those in 
which the effects of an intervention are compared with a placebo 
control or in which the effects of a new intervention are compared 
with those of conventional treatment, 4 ** as in randomized, prospec- 
tive, controlled clinical trials. Group (sample) sizes must be suffici- 
ent to maintain the type II error rate, or (3 (the risk of false-negative 
inference) at an acceptable level, usually 10-20%. It is more usual 
to speak of the power to reject the null hypothesis, or (1 - P), for 
which an acceptable level is 80-90%. The preoccupation with 
adequacy of power and, therefore, of group size stems from the 
argument that it would be wrong, statistically and ethically, to con- 
clude that there is no difference between treatments when, in fact, 
there is. At the same time, however, it is important not to recruit too 
many subjects into a trial, both on ethical grounds and because this 
would delay a change in clinical practice. 

It is curious that although the above considerations of group size 
and power are an important and universally accepted, indeed, uni- 
versally required, element of clinical studies, it has never become 
part of the ethos of laboratory experimentation. One of the very few 
who have argued that it should be is Ian McCance, 6 yet his argu- 
ments seem to have fallen on deaf ears. Institutional animal ethics 
committees (AEC) are exhorted (although not very strongly) to 
address this matter, 7 but they rarely seem to do so with any enthusi- 
asm. In addition, reviewers of manuscripts for journals that deal with 
animal experiments raise matters of group size and power only if 
the authors fail to confirm earlier reports of the occurrence of a given 
phenomenon. McCance was undoubtedly a prophet before his time, 
because it seems inevitable and, indeed, highly desirable that animal 
experimenters should not use more animals than necessary to 
achieve, say, 80% power in their experiment; and neither should they 
use so few that a negative finding can be attributed to inadequate 
group size and power. The task of estimating minimal group sizes 
and power has been made easy by sets of tables 8 and by software 
such as nQuery Advisor (Statistical Solutions, Statistical Solutions, 
Boston MA, USA). 

THE MEANING OF F- VALUES 

Most of us use hypothesis testing CP-values) as the basis for 
analysing the results of experiments. There is nothing wrong with 
this, although it is worth noting that there has been a putsch, especi- 
ally from a British school of statistics, to use estimation (confidence 
intervals; CI) instead. 9 Few physiologists and pharmacologists have 
been pushed to use CI and there is no good reason why they should. 
That is, there is no strong evidence that investigators or readers find 
CI more informative than P-values. 10 

But what does a P-value mean? To statisticians, it can mean any 
or all of many things, but the simplest definition is that it corresponds 
to the probability of type I error or a (the risk of a false-positive 



inference). More profound definitions depend on whether inferences 
are made under the population (Neyman-Pearson) model, which 
implies random sampling of defined populations, or under the ran- 
domization model, which implies taking non-random samples and 
randomizing the members to one or another set of conditions or treat- 
ments. The model of inference is somewhat incidental, except that 
statisticians tend to assume that experimenters take random samples 
(a very rare occurence), whereas experimenters assume that statis- 
ticians know that they do not." 

There is one other important matter to do with P. Another way 
of defining it is the probability of the observed (or more extreme) 
data when the the null hypothesis, Ho, is true. If one takes a f-test 
for equality of the means of two groups (strictly, two populations), 
coded 1 and 2, then Ho can be defined as: 

Ho:x 2 - xi 

However, there is another consideration. If Ho is rejected, an alter- 
native hypothesis (HO must be accepted. But Hi can take one of 
two forms: 

One-sided (specific) Hj = xs^xi 

Two-sided (non-specific) Hj - X2 ± x*i 

In almost every circumstance in biomedical research (and in all cir- 
cumstances in clinical research) the two-sided, non-specific, form 
of Hi should be preferred. Why? Because (at any rate in clinical 
research) one does not and cannot know that one treatment or state 
can never result in an outcome that cannot be worse than the other 
(specific Hi). In clinical studies, it would be unethical to presume 
that this is so. In laboratory experiments, the same argument applies. 

Thus, P should always be constructed in a two-sided non-specific 
fashion. Regrettably, investigators who are searching for 'signifi- 
cant* outcomes from their experiments sometimes select the one- 
sided outcome of a test because the P- value is (usually) half that of 
the two-sided outcome. It is a matter of opinion whether this should 
be regarded as scientific fraud or merely a form of 'data torturing'. 12 

There is one last point to make under this heading. Biomedical 
investigators have the habit of supporting their conclusions by state- 
ments such as 'there was a significant difference' (having stipulated 
earlier that they mean PS 0,05, or merely by stating *P<0-05*). 
There are two views about this convention. One is that if P » 0.05 
has been defined as the watershed between *significant'and *not sig- 
nificant', then it is sufficient to indicate this decision. The other view 
is that it is important to indicate the 'strength of evidence*; that is, 
the actual P-value attached to the null hypothesis. In the old days, 
when P-values were obtained from published tables, actual P-values 
were impossible* However, nowadays, when electronic tables are 
available as stand-alone computer programs or within computer 
statistics programs, actual P-values can be obtained- 1 am in no doubt 
that actual P-values should be given. My arguments are simple. 
A mere P<0.05 does not distinguish between P = 0,04999 and 
P » 0.000004999 and actual values of P are essential if multiple 
comparisons are to be made (vide infra). 

MAKING MULTIPLE INFERENCES FROM A 
SINGLE EXPERIMENT 

This topic causes great anguish to biomedical investigators, When 
statisticians or reviewers of manuscripts draw their attention to the 
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matter, their reaction is more often than not frank disbelief that it 
could be of any importance. So, I shall try to point out that it does 
matter, why it matters and what can be done about it. 

Comparison- wise versus family wise (experiment-wise) 
type I error rates 

The type I error rate conventionally nominated by investigators is 
5%. The goal of any statistical analysis of experimental results is 
to not make false-positive statistical inferences, when a false-positive 
inference means that the type I error rate exceeds that nominated 
by the investigators. This is bound to occur if more than one infer- 
ence is made from the results of a single experiment. Or, to put it 
crudely, if enough null hypotheses are tested and inferences made, 
then it is inevitable that a 'significant' lvalue will turn up. 

There are two approaches to controlling the type I error rate when 
multiple statistical inferences are made. One is to control the 
comparison-wise type I error rate. This implies that if any given com- 
parison and consequent inference is made, then if the same 
comparison and inference was made a great many times, the fre- 
quency of false-positive inferences from that comparison would not 
exceed 5% in the long run. However, control of the familywise 
(experiment-wise) type 1 error rate implies that if the whole experi- 
ment actually performed was replicated a great many times, the fre- 
quency of any false-positive inference from the experiment would 
not exceed 5%. Or, in simple terms, are all the inferences made from 
the experiment as a whole able to be replicated? 

Those in the trade of biostatistics are in no doubt about which of 
the two error rates should be controlled: the familywise (experiment- 
wise). And, surely, biomedical investigators should hope that if 
others were to repeat their experiment, they would reach the same 
conclusions? 

What is a family of hypotheses? 

Although statisticians are nearly unanimous in stating that the 
familywise type I error rate should be controlled, they are remark- 
ably reticent about providing a definition of a family of hypotheses. 
One of the better definitions is that by Miller. 13 He said; 'There 
are no hard-and-fast rules for where the family lines should be 
drawn . . ,\ But, then, more helpfully: 'The natural family for the 
author in the majority of instances is the individual experiment of 
a single researcher' (his italics). My own rules are as follows. 

1 , Usually, the family consists of all comparisons/inferences made 
from the results of a single experiment. This is not quite the same 
thing as saying that a family consists of all inferences drawn in a 
single published paper. For instance, a published paper may 
describe preliminary experiments in animals, followed by similar 
studies in humans. Surely it would be wrong to pool the two when 
it comes to defining a family of hypotheses? 

2. I have some sympathy for a minimalist approach, in which all 
inferences made from the information provided in a single table or 
figure should be regarded as belonging to a family. 

How to control the familywise type I error rate 

The best-known approach is to adjust the P-va!ues resulting from 
hypothesis (significance) testing. Some procedures make this 
adjustment under the assumption (explicit or, more often, implicit) 
that the comparison-wise type I error rate should be controlled. These 



include Fisher's restricted least significant difference (LSD), the 
Student-Newman-Keuls' (SNK) procedure and the Duncan multiple 
range procedure. These are quite popular among biomedical investi- 
gators, in part because they are provided by computer statistics pack- 
ages (and, in part, one fears, beause they are lenient). 

There are many procedures that control the familywise type I error 
rate. The problem is that some are altogether too conservative. The 
best-known is the Bonferroni procedure* brought to biomedical 
investigators' attention by the classical paper by Wallenstein et al 14 
This entails multiplying the raw P-values that result from testing 
the several (k) hypotheses by the number of hypotheses; that is: 

Adjusted P' - kP 

Sid&k's rather more subtle version of this is that: 

P' ^l-(\-P) m 

Both these procedures provide complete protection against exces- 
sive type I error. However, in both cases, it is assumed that all 
multiple hypotheses are independent of each other. This is very rarely 
the case. And, ifhypotheses are correlated, the Bonferroni and Siddk 
procedures are too harsh. 

Then there is a set of procedures that is designed specifically for 
data that have been measured on an interval scale. 14 These include 
the Tukey-Kramer procedure, for all possible pairwise contrasts 
among groups, the Dunnett procedure, for all pairwise contrasts of 
a control group against all others, and the Scheffe* procedure, for all 
possible contrasts among groups, pairwise or other. All three pro- 
cedures have been shown, by Monte Carlo simulation studies, to 
provide complete control over the familywise type 1 error rate in 
the particular circumstances. In addition, all are available as sub* 
routines in the more popular computer statistics packages. Their 
virtues are that they allow for both independent and related 
hypotheses. Their defect is that they can be applied only to data mea- 
sured on an interval (continuous) scale. 

Recently (i.e. within the past 50 years), another procedure has 
been described that caters for data measured on any scale (continu- 
ous, ordinal, categorical). All that is required is the actual raw 
P-values attached to the several hypotheses/inferences that have been 
made within a given family. It has been shown by Monte Carlo 
simulations that it affords complete protection against excessive 
familywise type 1 error and that, at the same time, it caters for both 
• independent and related hypotheses and it is very powerful. This is 
the Ryan-Holm stepdown Bonferroni procedure. It is remarkably 
simple to execute with pencil and paper or a hand-held calculator. 
It can be used to adjust P-values or CI. 1516 

The global approach 

There is another solution to the problem of multiple inferences that 
can sometimes be used. It is to use multivariate techniques to arrive 
at a single P- value for a global hypothesis (Table 1). These tech- 
niques are often complex, usually require the use of a computer 
statistics program and are potentially dangerous in the hands of those 
who are unfamiliar with the theoretical basis for the procedures or 
with the particular statistics program that is used to execute them. 
They also often require an expert to interpret the outcome and care- 
ful explanation to readers of the published paper. Nevertheless, the 
global approach can often be used to circumvent the problem of test- 
ing multiple hypotheses and making multiple inferences. 
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Table 1 Some global alternatives to multiple comparisons 



Procedure Dependent variable(s) Independent variable (s) 



One-, two- or multi-way anova 70 


One continuous 


One or more categorical 


Multivariate anova* 1 


Two or more continuous 


One or more categorical 


QtAmificii mullinlfk linear rAOWCCmtl^ 


One continuous 


Multiple continuous 


Repeated-measures anova 32 


Three or more repeated 


Two or more categorical continuous 


Cochran- Armitage 2 X c table 23 


Two rows 


Multiple ordered columns 


Homogeneity of OR for stratified 2x2 tables 23 


Multiple 2X2 tables 




G 2 statistic for stratified rXc tables (log-linear model ling) 23 
Stepwise binomial logistic regression 24 


Multiple r X c tables 




Single binomial 


Multiple continuous or categorical 


Cox proportion a I- hazards regression 25 


Single binomial 


Multiple categorical 



OR. odds ratio. 



DANGERS OF UNCRITICAL USE OF 
STATISTICS SOFTWARE 

I shall introduce this topic with an example. A few years ago, a Swiss 
pharmacologist and I met over the internet. He was puzzled that 
when he used the Wilcoxon-M an n- Whitney (WMW) rank-order test 
to analyse his experimental data, he arrived at two very different 
outcomes from two different statistics software packages. We ended 
up by publishing the results of our analysing these same data by no 
fewer than It commercial statistics packages. 17 The range of out- 
comes was from P = 0.0885 to P - 0.0147. That is, in conventional 
terms, the outcomes ranged from decidedly 'not significant' to 
decidedly 'significant' . The WMW procedure can be executed in at 
least five different ways. These are with a correction for ties, with 
a correction for continuity, with neither and with both. These 
four are based on asymptotic (large sample) approximations to the 
normal or Chi-squared distributions. The fifth way is by exact 
permutation. The first four variants of the WMW test can be executed 
by hand and we did this. Our concern was not so much with the 
wide range of P- values, as with the fact that so many of the statis- 
tics packages failed to describe, in their manuals or in their help files, 
precisely which variant(s) of the WMW procedure they executed. 
There was also an odd-man-out package, the outcome from 
which coincided with none of the above variants and was clearly 
due to an algorithmic error (later admitted by the vendor). There 
was another difficulty, which was that many of the packages sug- 
gested that the P-values resulting from the WMW procedure refer 
to the null hypothesis of equal group medians. This is simply 
not so. 17 

I did not find these results surprising, because I had had similar 
experiences with other statistical procedures and with other com- 
puter statistics packages. However, very few biomedical investi- 
gators appreciate this problem. In view of this, 1 offer the following 
pieces of advice in the form of questions you should ask before 
purchasing a computer statistics software package: 

1. Does it provide printed, comprehensive manuals? 

2. Are the statistical procedures fully documented? That is, does 
the manual give chapter-and- verse references to original articles in 
the statistical literature? 

3. Does it provide internet access to professional advisors on the 
statistical routines and their possible limitations? 

4. Before purchasing the software, have you consulted with 
colleagues who use it? And, have you taken along your own data 
sets and asked them to analyse these (and watched while they 
do it)? 



5. Have you thoroughly investigated where you can get the best 
price? Currently, 1 find that the best price (and the best advice about 
what a given package will do) is often obtained offshore via the 
internet. 

There are are two further pieces of advice. The first is prescriptive, 
the second advisory. 

6. Never purchase the latest version of a statistics package when 
it is released. Wait until the inevitable 'bugs' have been corrected. 

7. If you are starting off, go for the statistics package that provides 
more than you think you need. That is, don't go for the simplest. 
Go for a package that promises to extend the range of procedures 
that you currently use. Top-of-the range statistical programming 
packages include SAS (SAS Institute, Cary, NC, USA) and s-plus 
(MathSoft Inc., Seattle, WA. USA). These are really designed for 
professional statisticians. Next down the line are packages that pro- 
vide informative manuals, online internet help and a wide variety 
of statistical procedures. These include SPSS (SPSS Inc., Chicago, 
IL, USA), systat (SPSS Inc.) and. perhaps. Statistica (StatSoft Inc., 
Tulsa, OK, USA). My advice is not to descend to the third level of 
elementary, 'user-friendly' and relatively cheap programs (includ- 
ing spreadsheets). 

BIOSTATISTICIANS 

The most obvious solution to the biostatistical difficulties that have 
been described above is to ensure that every group of biomedical 
investigators, targe or small, has access to professional biostatistical 
advice. The emphasis is on 'biostatistical', not merely 'statistical*, 
because investigators complain bitterly that if they approach a 
statistical consulting service they find that a great deal of their time 
(and money) is spent in explaining the biological problem to the 
statistician. A review of the 1998-2000 Annual Reports of five large 
biomedical research institutes in Australia shows that the average 
cost of production of each publication is approximately $140 000. 
Diversion of even 1% of this should provide an excellent biostatis- 
tical consulting service. The same argument, although on a different 
scale, applies to smaller research groups. 

However, if biomedical investigators have the will to invest in 
biostatistical expertise, there is a very serious practical obstacle to 
finding a way. It is that there is a serious shortage of biostatisticians 
in Australia. Attention has been drawn to this in the context of 
epidemiology by John Carlin. 18 A Biostatistics Collaboration of 
Australia has been formed. It has urged the Department of Health 
and Aged Care to set aside, within the Public Health Education and 
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Research Program (PHERP), funds for the development of a three- 
tier bioscatistical award structure to develop and deliver by distance 
mode a Graduate Certificate, Graduate Diploma or Masters Degree 
in Biostatistics. The Biostatistics Collaboration has also developed 
a curriculum outline for such courses. Information on these matters 
can be obtained at the website http;//www.ctc,u$ydxdu.au/BCA 

If the laudable initiatives undertaken by the Biostatistics 
Collaboration are successful, at least a start will have been made in 
rectifying the shortage of biostatisticians in Australia. However, two 
problems will remain. One is to assure employment for the diplo- 
mates and graduates in biostatistics within the broad field of bio- 
medical research. The other touches the disciplines within which 
members of APPS work: usually laboratory experimental research, 
whether in humans, animals, tissues or cells. 19 ft is not at all certain 
that the postgraduate qualifications proposed by the Biostatistics 
Collaboration will lead to an understanding of the experimental 
techniques and designs used in these research disciplines or of the 
corresponding statistical techniques for analysing the results of such 
experiments. Members of APPS and of other societies concerned 
with experimental biology should examine this question. 
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